# Pckgs -------------------------------------
library(fs) # Cross-Platform File System Operations Based on 'libuv'
library(here) # A Simpler Way to Find Your Files
library(paint) # paint data.frames summaries in colour
library(tidyverse) # Easily Install and Load the 'Tidyverse'
library(janitor) # Simple Tools for Examining and Cleaning Dirty Data
library(skimr) # Compact and Flexible Summaries of Data
library(readxl) # Read Excel Files
library(kableExtra) # Construct Complex Table with 'kable' and Pipe Syntax)
library(tidytext) # Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools
library(patchwork) # The Composer of Plots
# # ML -------------------------------------
# library(tidymodels) # Easily Install and Load the 'Tidymodels' Packages
# library(textrecipes) # Extra 'Recipes' for Text Processing
# # TEXT ANALYTICS -------------------------------------
# library(tidytext) # Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools
# library(SnowballC) # Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library
# library(cleanNLP) # A Tidy Data Model for Natural Language Processing WB Project PDO text EDA
Work in progress
TO DO [COMPL🟠🟡] list
- check if the peaks in sector words correspond to peak in sector (the feature)
- add vertical lines with relevant WDR over the years
- do bigram plots on “climate change” and “climate resilient”
Set up
Set up ggplot2 theme
—————————————————————————-
Data sources
The data used in this analysis comes from the World Bank’s Projects & Operations database.
Since some pre-processing steps are computationally expensive, I did that in a separate step (analysis/_01a_WB_project_pdo_prep.qmd), WHERE:
-
Retrieved manually ALL WB projects (22,569) approved between FY 1947 and 2026 as of 31/08/2024 using simply the
Excel buttonon this page WBG Projects- Of these, approximately 50% (11,322 projects) had a “viable” PDO text in the dataset (i.e., not blank or labeled as “TBD”, etc.).
- There are no Project Development Objectives available in projects approved before FY2001
- However, other than approval year, based on some tests on available projects’ features, PDO texts’ missingness seems to happen at random.
Dropped from the analysis: projects with no PDO text (including those approved before FY2001), and projects with missing status, FY of approval –> 8,811 usable projects selected.
-
Split the dataset into training / validation / test subsets (proportional to FY and regional distribution).
- Here I work only on training set (~50% of usable ones, i.e. 4,403 PDOs).
Cleaned the data (parse dates, recode variables, etc.), fix typos, unwanted special characters, and other unimportant issues in PDOs.
Obtained PoS tagging + tokenization with
cleanNLPpackage (functionscnlp_init_udpipe()+cnlp_annotate()) and savedprojs_train_t(cleaned train dataset).
cleanNLP package
cleanNLP supports multiple backends for processing text, such as CoreNLP, spaCy, udpipe, and stanza. Each of these backends has different capabilities and might require different initialization procedures.
-
CoreNLP~ powerful Java-based NLP toolkit developed by Stanford, which includes many linguistic tools like tokenization, part-of-speech tagging, and named entity recognition.- ❕❗️ NEEDS EXTERNAL INSTALLATION (must be installed in Java with
cnlp_install_corenlp()which installs the Java JAR files and models)
- ❕❗️ NEEDS EXTERNAL INSTALLATION (must be installed in Java with
-
spaCy~ fast and modern NLP library written in Python. It provides advanced features like dependency parsing, named entity recognition, and tokenization.- ❕❗️ NEEDS EXTERNAL INSTALLATION (fmst be installed in Python (with
spacy_install()which installs bothspaCyand necessary Python dependencies) and thespacyrR package must be installed to interface with it.
- ❕❗️ NEEDS EXTERNAL INSTALLATION (fmst be installed in Python (with
-
udpipe~ R package that provides bindings to theUDPipeNLP toolkit. Fast, lightweight and language-agnostic NLP library for tokenization, part-of-speech tagging, lemmatization, and dependency parsing. -
stanza~ another modern NLP library from Stanford, similar to CoreNLP but built on PyTorch and supports over 66 languages…
when you initialize a back-end (like CoreNLP) in cleanNLP, it stays active for the entire session unless you reinitialize or explicitly change it.
# ---- 1) Initialize the CoreNLP backend
library(cleanNLP) # A Tidy Data Model for Natural Language Processing # A Tidy Data Model for Natural Language Processing # A Tidy Data Model for Natural Language Processing
cnlp_init_corenlp()
# If you want to specify a language or model path:
cnlp_init_corenlp(language = "en",
# model_path = "/path/to/corenlp-models"
)
# ---- 2) Initialize the spaCy backend
library(cleanNLP) # A Tidy Data Model for Natural Language Processing # A Tidy Data Model for Natural Language Processing # A Tidy Data Model for Natural Language Processing
library(spacyr) # Wrapper to the 'spaCy' 'NLP' Library
# Initialize spaCy in cleanNLP
cnlp_init_spacy()
# Optional: specify language model
cnlp_init_spacy(model_name = "en_core_web_sm")
# ---- 3) Initialize the udpipe backend
library(cleanNLP) # A Tidy Data Model for Natural Language Processing # A Tidy Data Model for Natural Language Processing # A Tidy Data Model for Natural Language Processing
# Initialize udpipe backend
cnlp_init_udpipe(model_name = "english")
# ---- 4) Initialize the stanza backend[TBL] Illustrative PDOs text in Projects’ documents
| Project_ID | Project_Name | Project_Development_Objective |
|---|---|---|
| P127665 | Second Economic Recovery Development Policy Loan | This development policy loan supports the Government of Croatia's reform efforts with the aim to: (i) enhance fiscal sustainability through expenditure-based consolidation; and (ii) strengthen investment climate. |
| P069934 | PERNAMBUCO INTEGRATED DEVELOPMENT: EDUCATION QUALITY IMPROVEMENT PROJECT | The development objectives of the Pernambuco Integrated Development: Education Quality Improvement Project are to (a) improve the quality, efficiency, and inclusiveness of the public education system; (b) modernize and strengthen the managerial, financial, and administrative capacity of the Secretariat of Education to set policies and guidelines for the sector and deliver public education efficiently; and (c) support the overall state modernization effort through interventions to be carried out in the Secretariat of Education and to be replicated in other state institutions. |
Notes on PDO text data quality
First, it is important to notice that all 7,548 projects approved before FY2001 had no PDO text available.
The exploratory analysis of the 11,353 projects WITH PDO text revealed some interesting findings:
- PDO text length: The PDO text is quite short, with a median of 2 sentences and a maximum of 9 sentences.
-
PDO text missingness: besides 11,306 projects with missing PDOs, 31 projects had some invalid PDO values, namely:
- 11 have PDO as one of: “.”,“-”,“NA”, “N/A”
- 7 have PDO as one of: “No change”, “No change to PDO following restructuring.”,“PDO remains the same.”
- 9 have PDO as one of: “TBD”, “TBD.”, “Objective to be Determined.”
- 4 have PDO as one of: “XXXXXX”, “XXXXX”, “XXXX”, “a”
Of the remaining 11,322 projects with a valid PDO, some more projects were excluded from the analysis for incompleteness:
- 3 projects without “project status”
- 2,176 projects without “board approval FY”
- 332 projects approved in FY >= FY2024 (for incomplete approval stage)
Lastly (and this was quite surprising to me) the remaining, viable 8,811 unique projects, were matched by only 7,582 unique PDOs! In fact, 2,235 projects share 1,006 NON-UNIQUE PDO text in the “cleaned” dataset. Why? Apparently, the same PDO is re-used for multiple projects (from 2 to as many as 9 times), likely in cases of follow-up phases of a parent project or components of the same lending program.”
In sum, the cleaning process yielded a usable set of 8,811 functional projects, which was split into a training subset (4,403) to explore and test models and a testing subset (4408), held out for post-prediction evaluation.
Evidently, in some cases,the same PDO is used for multiple projects (from a minimum of 2 to a maximum of 9 time!!!), most likely when there is a parent project or subsequent phases of the same lending program.
—————————————————————————
Load pre-processed Projs’ dataset + PDO dataset
Here I will just load the pre-processed data (training set only).
[Saved file projs_train_t & pdo_train_t]
Previous Tokenization and PoS Tagging
Typically, one of the first steps in this transformation from natural language to feature, or any of kind of text analysis, is tokenization.
i) Explain Tokenization
Breaking units of language into components relevant for the research question is called “tokenization”. Components can be words, n-grams, sentences, etc. or combining smaller units into larger units.
- Tokenization is a
row-wiseoperation: it changes the number of rows in the dataset.
The choices of tokenization
- Should words be lower cased?
- Should punctuation be removed?
- Should numbers be replaced by some placeholder?
- Should words be stemmed (also called lemmatization)? ☑️
- Should bigrams/multi-word phrase be used instead of single word phrases? ☑️
- Should stopwords (the most common words) be removed? ☑️
- Should rare words be removed? ❌
- Should hyphenated words be split into two words? ❌
for the moment I keep all as conservatively as possible
ii) Explain Pos Tagging
Linguistic annotation is a common for of enriching text data, i.e. adding information about the text that is not directly present in the text itself.
Upon this, e.g. classifying noun, verb, adjective, etc., one can discover intent or action in a sentence, or scanning “verb-noun” patterns.
Here I have a training dataset file with:
| Variable | Type | Provenance | Description | Example |
|---|---|---|---|---|
| proj_id | chr | original PDO data | ||
| pdo | chr | original PDO data | ||
| word | chr | original PDO data | Governments | |
| sid | int | output cleanNLP | sentence ID | |
| tid | chr | output cleanNLP | token ID within sentence | |
| token | chr | output cleanNLP | Tokenized form of the token. | government |
| token_with_ws | chr | output cleanNLP | Token with trailing whitespace | government |
| lemma | chr | output cleanNLP | The base form of the token | government |
| stem | chr | output SnowballC | The base form of the token | govern |
| upos | chr | output cleanNLP | Universal part-of-speech tag (e.g., NOUN, VERB, ADJ). | |
| xpos | chr | output cleanNLP | Language-specific part-of-speech tags. | |
| feats | chr | output cleanNLP | Morphological features of the token | |
| tid_source | chr | output cleanNLP | Token ID in the source document | |
| relation | chr | output cleanNLP | Dependency relation between the token and its head token | |
| pr_name | chr | output cleanNLP | Name of the parent token | |
| FY_appr | dbl | original PDO data | ||
| FY_clos | dbl | original PDO data | ||
| status | chr | original PDO data | ||
| regionname | chr | original PDO data | ||
| countryname | chr | original PDO data | ||
| sector1 | chr | original PDO data | ||
| theme1 | chr | original PDO data | ||
| lendinginstr | chr | original PDO data | ||
| env_cat | chr | original PDO data | ||
| ESrisk | chr | original PDO data | ||
| curr_total_commitment | dbl | original PDO data |
— PoS Tagging: upos (Universal Part-of-Speech)
| upos | n | percent | explan |
|---|---|---|---|
| ADJ | 21261 | 0.0852623 | Adjective |
| ADP | 27050 | 0.1084777 | Adposition |
| ADV | 2950 | 0.0118303 | Adverb |
| AUX | 3588 | 0.0143888 | Auxiliary |
| CCONJ | 14236 | 0.0570902 | Coordinating conjunction |
| DET | 21505 | 0.0862408 | Determiner |
| INTJ | 57 | 0.0002286 | Interjection |
| NOUN | 70752 | 0.2837344 | Noun |
| NUM | 2190 | 0.0087825 | Numeral |
| PART | 8691 | 0.0348532 | Particle |
| PRON | 2330 | 0.0093439 | Pronoun |
| PROPN | 14856 | 0.0595765 | Proper noun |
| PUNCT | 28393 | 0.1138635 | Punctuation |
| SCONJ | 2160 | 0.0086622 | Subordinating conjunction |
| SYM | 316 | 0.0012672 | Symbol |
| VERB | 25806 | 0.1034889 | Verb |
| X | 3219 | 0.0129090 | Other |
On random visual check, these are not always correct, but they are a good starting point for now.
iii) Custom Stopwords
Remove stop words, which are the most common words in a language.
- but I don’t want to remove any meaningful word for now
# Custom list of articles, prepositions, and pronouns
custom_stop_words <- c(
# Articles
"the", "a", "an",
"and", "but", "or", "yet", "so", "for", "nor", "as", "at", "by", "per",
# Prepositions
"of", "in", "on", "at", "by", "with", "about", "against", "between", "into", "through",
"during", "before", "after", "above", "below", "to", "from", "up", "down", "under",
"over", "again", "further", "then", "once",
# Pronouns
"i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your",
"yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her",
"hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves" ,
"this", "that", "these", "those", "which", "who", "whom", "whose", "what", "where",
"when", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other",
# "some", "such", "no", "not",
# "too", "very",
# verbs
"is", "are", "would", "could", "will", "be", "e.g", "e.g.", "i.e.",
"i", "ii", "iii", "iv", "v",
# because tautology
"pdo"
)
# Convert to a data frame if needed for consistency with tidytext
custom_stop_words_df <- tibble(word = custom_stop_words)iv) Stemming
Often documents contain different versions of one base word, often called a stem. Stemming is the process of reducing words to their base or root form.
Snowball is one framework released in 1980 with an open-source license that can be found in R package SnowballC.
Why Stemming?: For example, in topic modeling, stemming reduces noise by making it easier for the model to identify core topics without being distracted by grammatical variations. (Lemmatization is more computationally intensive as it requires linguistic context and dictionaries, making it slower, especially on large datasets)
| Token | Lemma | Stem |
|---|---|---|
| development | development | develop |
| quality | quality | qualiti |
| high-quality | high-quality | high-qual |
| include | include | includ |
| logistics | logistic | logist |
| government/governance | Governemnt/government/governance | govern |
NOTE: Among
word/stemsencountered in PDOs, there are a lot of acronyms which may refer to World Bank lingo, or local agencies, etc… Especially when looked at in low case form they don’t make much sense…
Notes on sparsity
Sparsity in the context of a document-term matrix refers to the proportion of cells in the matrix that contain zeros. High sparsity means that most terms do not appear in most documents.
- removing stopwords before stemming can reduce sparsity
-
tidytext::cast_tdmturns a “tidy” one-term-per-document-per-row data frame into a Document-Term Matrix (DTM) from thetmpackage.- this dataset contains 4403 documents (each of them a PDO) and 11029 terms (distinct words). Notice that this DTM is 100% sparse (100% of document-word pairings are zero, bc most pairings of document and term do not occur (they have the value zero).
# create document-word matrix
DTM <- pdo_train_t %>%
anti_join(custom_stop_words_df, by = c("token_l" = "word")) %>%
count(proj_id, token_l) %>%
tidytext::cast_dtm(proj_id, token_l, n) # HIGH!!!
DTM
# <<DocumentTermMatrix (documents: 4403, terms: 11029)>>
# Non-/sparse entries: 129940/48430747
# Sparsity : 100%
# Maximal term length: 34
# Weighting : term frequency (tf)v) Document-term matrix or TF-IDF
The tf-idf is the product of the term frequency and the inverse document frequency::
\[ \begin{aligned} tf(\text{term}) &= \frac{n_{\text{term}}}{n_{\text{terms in document}}} \\ idf(\text{term}) &= \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)} \\ tf\text{-}idf(\text{term}) &= tf(\text{term}) \times idf(\text{term}) \end{aligned} \]
— TF-IDF matrix on train pdo
# reduce size
pdo_train_4_tf_idf <- pdo_train_t %>% # 255964
# Keep only content words [very restrictive for now]
# normally c("NOUN", "VERB", "ADJ", "ADV")
filter(upos %in% c("NOUN")) %>% # 72,668
filter(!token_l %in% c("development", "objective", "project")) %>% # 66,741
# get rid of stop words (from default list)
filter(!token_l %in% custom_stop_words_df$word) %>% # 66,704
# Optional: Remove lemmas of length 1 or shorter
filter(nchar(lemma) > 1) # 66,350Now, count the occurrences of each lemma for each document. (This is the term frequency or tf)
With the lemma counts prepared, the bind_tf_idf() function from the tidytext package computes the TF-IDF scores.
What to use: token, lemma, or stem?
General Preference in Real-World NLP:
-
Tokensfor analyses where word forms matter or for sentiment analysis. -
Lemmas(*) for most general-purpose NLP tasks where you want to reduce dimensionality while maintaining accuracy and clarity of meaning. -
Stemsfor very large datasets, search engines, and applications where speed and simplicity are more important than linguistic precision.
(*) I use lemma, after “aggressively” reducing the number of words to consider, and removing stop words (at least for now).
_______
TEXT ANALYSIS/SUMMARY
_______
We are looking at (training data subset) pdo_train_t which has 249360 rows and 26 columns obtained from 4,403 PDOs (of which 4050 unique) of 4403 Wold Bank projects approved in Fiscal Years ranging from 2001 to 2023.
[TBL] Frequencies of documents/words/stems
| entity | counts |
|---|---|
| N proj | 4403 |
| N PDOs | 4050 |
| N words | 12953 |
| N token | 11114 |
| N lemma | 11178 |
| N stem | 8541 |
Term frequency
Note: normally, the most frequent words are function words (e.g. determiners, prepositions, pronouns, and auxiliary verbs), which are not very informative. Moreover, even content words (e.g. nouns, verbs, adjectives, and adverbs) can often be quite generic semantically speaking (e.g. “good” may be used for many different things).
However, in this analysis, I do not use the STOPWORD approach, but use the POS tags to reduce – in a more controlled way – the dataset, filtering the content words such as nouns, verbs, adjectives, and adverbs.
[FUNC] save plots
[FIG] Overall token freq ggplot
- Excluding “project” “develop”,“objective”
- Including only “content words” (NOUN, VERB, ADJ, ADV)
# Evaluate the title with glue first
title_text <- glue::glue("Most frequent TOKEN in {n_distinct(pdo_train_t$proj_id)} PDOs from projects approved between FY {min(pdo_train_t$FY_appr)}-{max(pdo_train_t$FY_appr)}")
pdo_wrd_freq <- pdo_train_t %>% # 123,927
# include only content words
filter(upos %in% c("NOUN", "VERB", "ADJ", "ADV")) %>%
#filter (!(upos %in% c("AUX","CCONJ", "INTJ", "DET", "PART","ADP", "SCONJ", "SYM", "PART", "PUNCT"))) %>%
filter (!(relation %in% c("nummod" ))) %>% # 173,686
filter (!(token_l %in% c("pdo","project", "development", "objective","objectives", "i", "ii", "iii",
"is"))) %>% # whne it is VERB
count(token_l) %>%
filter(n > 800) %>%
mutate(token_l = reorder(token_l, n)) # reorder values by frequency
# plot
pdo_wrd_freq_p <- pdo_wrd_freq %>%
ggplot(aes(token_l, n)) +
geom_col(fill = "#d7b77b") +
scale_y_continuous(breaks = seq(0, max(pdo_wrd_freq$n), by = 400)) + # directly use 'n' instead of .data$n
coord_flip() + # flip x and y coordinates so we can read the words better
labs(#title = title_text,
subtitle = "[TOKEN with count > 800]", y = "", x = "")+
geom_hline(yintercept = 800, linetype = "dashed", color = "#873c4a") +
my_pretty_theme[FIG] Overall stem freq ggplot
- Without “project” “develop”,“objective”
- Including only “content words” (NOUN, VERB, ADJ, ADV)
# Evaluate the title with glue first
title_text <- glue::glue("Most frequent STEM in {n_distinct(pdo_train_t$proj_id)} PDOs from projects approved between FY {min(pdo_train_t$FY_appr)}-{max(pdo_train_t$FY_appr)}")
# Plot
pdo_stem_freq <- pdo_train_t %>% # 256,632
# include only content words
filter(upos %in% c("NOUN", "VERB", "ADJ", "ADV")) %>%
filter (!(relation %in% c("nummod" ))) %>% # 173,686
filter (!(stem %in% c("pdo", "project", "develop", "object", "i", "ii", "iii"))) %>%
count(stem) %>%
filter(n > 800) %>%
mutate(stem = reorder(stem, n)) # reorder values by frequency
# plot
pdo_stem_freq_p <- pdo_stem_freq %>%
ggplot(aes(stem, n)) +
geom_col(fill = "#d7b77b") +
scale_y_continuous(breaks = seq(0, max(pdo_stem_freq$n), by = 400)) + # directly use 'n' instead of .data$n
coord_flip() + # flip x and y coordinates so we can read the words better
labs(#title = title_text,
subtitle = "[STEM with count > 800]", y = "", x = "") +
geom_hline(yintercept = 800, linetype = "dashed", color = "#873c4a") +
my_pretty_themeEvidently, after stemming, more words (or stems) reach the threshold frequency count of 800 (they have been combined by root).
[FIG] token + stem freq ggplot
title2_text <- glue::glue("Most frequent TOKEN & STEM in {n_distinct(pdo_train_t$proj_id)} PDOs")
subtitle2_text <- glue::glue("From projects approved between FY {min(pdo_train_t$FY_appr)}-{max(pdo_train_t$FY_appr)}")
combo_freq <- pdo_wrd_freq_p + pdo_stem_freq_p +
plot_annotation(title = title2_text,
subtitle = subtitle2_text,
# caption = "Source: World Bank Project Documents",
theme = theme(plot.title = element_text(size = 12, face = "bold"),
plot.subtitle = element_text(size = 10, face = "italic"),
plot.caption = element_text(size = 10, face = "italic"))
)
combo_freqf_save_plot("combo_freq", combo_freq)_______
SECTOR in PDO v. WDR publications
For the (broadly defined) HEALTH sector, it is quite clear that Covid-19 is the main driver of the peak in 2020.
What about the other sectors? I was struck by the fact that, observing PDOs over time, the broadly defined “sector term” in the PDO always presents at least one peak and I wonder what could trigger it.
One possible explanation is that the PDOs somehow reflect the topics discussed by the World Development Reports (WDR) published annually by the World Bank. The WDR is a flagship publication of the World Bank that provides in-depth analysis of a specific aspect of development.
It is important to remark that these publications are not some speculative research endeavor, as they are deeply rooted in the concrete information that the Bank retrieves on the ground from projects and operations as they are supported and evaluated. In turn, the WDRs themselves inform the Bank’s policy priorities and operational strategies.
Therefore, it is reasonable to expect some kind of correlation between the topics discussed in the WDRs and the objectives of projects stated in in the PDOs.
Ingest WDR data
Previously created, as explained in data/derived_data/_provenance.md
— Manually add WDR 2023
library(tibble)
# Create a named list of NA values for subj_11 to subj_46
na_values <- setNames(rep(NA, 35), paste0("subj_", 12:46))
# Add a new row with the existing columns and NA for subj_11 to subj_46
wdr <- wdr %>%
add_row(
date_issued = 2023,
decade = "2020s",
id = NA, # ?
ISBN = "978-1-4648-1941-4",
title = "Migrants, Refugees, and Societies",
doc_mt_identifier_1 = "oai:openknowledge.worldbank.org:10986/39696", #?
subject_miss = NA,
abstract = "Migration is a development challenge. About 184 million people-2.3 percent of the world’s population-live outside of their country of nationality. Almost half of them are in low- and middle-income countries. But what lies ahead? As the world struggles to cope with global economic imbalances, diverging demographic trends, and climate change, migration will become a necessity in the decades to come for countries at all levels of income. If managed well, migration can be a force for prosperity and can help achieve the United Nations’ Sustainable Development Goals. World Development Report 2023 proposes an innovative approach to maximize the development impacts of cross-border movements on both destination and origin countries and on migrants and refugees themselves. The framework it offers, drawn from labor economics and international law, rests on a “Match and Motive Matrix” that focuses on two factors: how closely migrants’ skills and attributes match the needs of destination countries and what motives underlie their movements. This approach enables policy makers to distinguish between different types of movements and to design migration policies for each. International cooperation will be critical to the effective management of migration.",
url_keys = "https://openknowledge.worldbank.org/handle/10986/39696",
altmetric = 150,
all_topic = "Poverty Reduction,Social Development,Conflict and Development",
all_subj = "migration,migrants,refugees,force displacement,crss-border mobility,remittances,origin country,international protection,refugee-hosting country,irregular migration,international cooperation",
subj_1 = "migration",
subj_2 = "migrants",
subj_3 = "refugees",
subj_4 = "force displacement",
subj_5 = "crss-border mobility",
subj_6 = "remittances",
subj_7 = "origin country",
subj_8 = "international protection",
subj_9 = "refugee-hosting country",
subj_10 = "irregular migration",
subj_11 = "international cooperation",
!!!na_values # Unpack the NA values for subj_12 to subj_46
)— Manually add WDR 2024
library(tibble)
# Create a named list of NA values for subj_11 to subj_46
na_values <- setNames(rep(NA, 35), paste0("subj_", 12:46))
# https://documents.worldbank.org/en/publication/documents-reports/documentdetail/099042523192514880/p17826903573340450b2d00e8cfd3baf7ac
# https://openknowledge.worldbank.org/entities/publication/5e5ac9f1-71ee-4734-825e-60966658395f/full
# Add a new row with the existing columns and NA for subj_11 to subj_46
wdr <- wdr %>%
add_row(
date_issued = 2024,
decade = "2020s",
id = NA, # ?
ISBN = "978-1-4648-2078-6",
title = "The Middle-Income Trap",
doc_mt_identifier_1 = "oai:openknowledge.worldbank.org:10986/41919", #?
subject_miss = NA,
abstract = "Middle-income countries are in a race against time. Many of them have done well since the 1990s to escape low-income levels and eradicate extreme poverty, leading to the perception that the last three decades have been great for development. But the ambition of the more than 100 economies with incomes per capita between US$1,100 and US$14,000 is to reach high-income status within the next generation. When assessed against this goal, their record is discouraging. Since the 1970s, income per capita in the median middle-income country has stagnated at less than a tenth of the US level. With aging populations, growing protectionism, and escalating pressures to speed up the energy transition, today’s middle-income economies face ever more daunting odds. To become advanced economies despite the growing headwinds, they will have to make miracles. Drawing on the development experience and advances in economic analysis since the 1950s, World Development Report 2024 identifies pathways for developing economies to avoid the “middle-income trap.” It points to the need for not one but two transitions for those at the middle-income level: the first from investment to infusion and the second from infusion to innovation. Governments in lower-middle-income countries must drop the habit of repeating the same investment-driven strategies and work instead to infuse modern technologies and successful business processes from around the world into their economies. This requires reshaping large swaths of those economies into globally competitive suppliers of goods and services. Upper-middle-income countries that have mastered infusion can accelerate the shift to innovation—not just borrowing ideas from the global frontiers of technology but also beginning to push the frontiers outward. This requires restructuring enterprise, work, and energy use once again, with an even greater emphasis on economic freedom, social mobility, and political contestability. Neither transition is automatic. The handful of economies that made speedy transitions from middle- to high-income status have encouraged enterprise by disciplining powerful incumbents, developed talent by rewarding merit, and capitalized on crises to alter policies and institutions that no longer suit the purposes they were once designed to serve. Today’s middle-income countries will have to do the same.",
url_keys = "https://openknowledge.worldbank.org/handle/10986/41919",
altmetric = 13,
all_topic = "Macroeconomics,Economic Growth,Business Cycles and Stabilization Policies,Poverty Reduction,Achieving Shared Growth,Science and Technology Development,Innovation",
all_subj = "middle-income trap,investment,infusion,innovation,technologies,competitive suppliers,economic freedom",
subj_1 = "middle-income trap",
subj_2 = "investment",
subj_3 = "infusion",
subj_4 = "innovation",
subj_5 = "technologies",
subj_6 = "competitive suppliers",
subj_7 = "economic freedom",
subj_8 = NA,
subj_9 = NA,
subj_10 = NA,
subj_11 = NA,
!!!na_values # Unpack the NA values for subj_12 to subj_46
)— Manually correct WDR 2011
wdr$url_keys [wdr$id == "4389"] <- "https://openknowledge.worldbank.org/handle/10986/4389"
wdr$altmetric [wdr$id == "4389"] <- "210"
wdr$abstract [wdr$id == "4389"] <- "The 2011 World development report looks across disciplines and experiences drawn from around the world to offer some ideas and practical recommendations on how to move beyond conflict and fragility and secure development. The key messages are important for all countries-low, middle, and high income-as well as for regional and global institutions: first, institutional legitimacy is the key to stability. When state institutions do not adequately protect citizens, guard against corruption, or provide access to justice; when markets do not provide job opportunities; or when communities have lost social cohesion-the likelihood of violent conflict increases. Second, investing in citizen security, justice, and jobs is essential to reducing violence. But there are major structural gaps in our collective capabilities to support these areas. Third, confronting this challenge effectively means that institutions need to change. International agencies and partners from other countries must adapt procedures so they can respond with agility and speed, a longer-term perspective, and greater staying power. Fourth, need to adopt a layered approach. Some problems can be addressed at the country level, but others need to be addressed at a regional level, such as developing markets that integrate insecure areas and pooling resources for building capacity Fifth, in adopting these approaches, need to be aware that the global landscape is changing. Regional institutions and middle income countries are playing a larger role. This means should pay more attention to south-south and south-north exchanges, and to the recent transition experiences of middle income countries."
wdr$all_topic [wdr$id == "4389"] <- tolower("Justice,Jobs,Political Violence and Civil War,Political Violence and War,Organized Crime,Fragility,Conflict and Violence,Crime,Social Cohesion,Public Sector Management,Social Development,Law and Development, Social Protections and Labor,Conflict and Development,Water Supply and Sanitation,Judicial System Reform, Labor Markets,Armed Conflict,Urban Solid Waste Management")
# Define the subjects to be added for the specific row
subjects <- c(
"Armed Conflict",
"Civil Wars",
"Conflict Prevention",
"Conflict Resolution",
"Development Policy",
"Fragile States",
"International Development",
"Peacebuilding",
"Political Instability",
"Post-Conflict Reconstruction",
"Security and Development"
) %>% tolower() # Convert subjects to lowercase
# Ensure id is handled as character and enforce lowercase comparison
wdr <- wdr %>%
mutate(across(starts_with("subj_"),
~ ifelse(id == "4389",
subjects[as.numeric(sub("^subj_", "", cur_column()))],
NA_character_))) %>%
mutate (all_subj = if_else(id == "4389", paste0(subjects, collapse = ","), all_subj))
# Check the result for the row with id == "4389"
wdr %>% filter(id == "4389") %>% select(starts_with("subj_")) # Display the updated subject columns
# check <- wdr[wdr$id == "4389",] — Remove extra space in title column
— re-save upon correction wrd2.rds
[TBL] World Develompent Reports 2000-2024
Below are the titles of the World Development Reports from 2000 to 2024.
Qualify: peak or trend (by sector)
Add geomvline to sector plots v WDR title [CMPL 🟠]
tabyl(pdo_train_t$tok_sector_broad)
# pdo_train_t$tok_sector_broad n WDR
# AGR_FOR_FISH 665 WDR 2008 Agriculture for Development
# EDUCATION 1180 WDR 2004 Making Services Work for Poor People
# ENERGY 886 WDR
# FINANCIAL 1843 WDR
# GENDER_EQUAL 213 WDR 2012 Gender Equality and Development
# HEALTH 946 WDR
# ICT 548 WDR
# IND TRADE SERV 60 WDR
# INSTITUTIONAL SUPP. 2171 WDR
# MINING OIL&GAS 299 WDR
# TRANSPORT 1371 WDR
# URBAN 553 WDR
# WAT_SAN 1069 WDR — ✅ AGR_FOR_FISH ( Agriculture, forestry, and fishing)
The WDR of 2008 was titled ” Agriculture for Development”, link
# --- Get a LIST of unique sectors (facets) and split the data
sector_list <- base::split(x = sector_broad, f = sector_broad$tok_sector_broad)
# Specific split df
#sector_list$'AGR_FOR_FISH'
# Specific plot
pdo_agr_plot <- sector_plots$'AGR_FOR_FISH' +
geom_vline(xintercept = 2012, linetype = "solid", color = "#9b6723",alpha = 0.35) +
geom_text(aes(x = 2012, y = max(n) * 0.5, label = "WDR Agric"),
angle = 90, vjust = -0.5, color = "#9b6723")
pdo_agr_plotf_save_plot("pdo_agr_plot", pdo_agr_plot)— ✅ EDUCATION
WDR 2007 was titled ” Development and the Next Generation” WDR 2018 was titled ” Learning to Realize Education’s Promise”
# Specific split df
#sector_list$EDUCATION
# Specific plot
pdo_edu_plot <- sector_plots$EDUCATION +
geom_vline(xintercept = 2007, linetype = "solid", color = "#9b6723",alpha = 0.35) +
geom_text(aes(x = 2007, y = max(n) * 0.45, label = "WDR Youth"),
angle = 90, vjust = -0.5, color = "#9b6723") +
geom_vline(xintercept = 2018, linetype = "solid", color = "#9b6723",alpha = 0.35) +
geom_text(aes(x = 2018, y = max(n) * 0.30, label = "WDR Educ"),
angle = 90, vjust = -0.5, color = "#9b6723")
pdo_edu_plotf_save_plot("pdo_edu_plot", pdo_edu_plot)— ✅ CLIMATE (climate change)
The WDR of 2010 was titled ” Development and Climate Change”, link
# --- Get a LIST of unique sectors (facets) and split the data
sector_list <- base::split(x = sector_broad, f = sector_broad$tok_sector_broad)
# GENDER split df
#sector_list$'CLIMATE'
# Specific plot
pdo_clim_plot <- sector_plots$'CLIMATE' +
# geom_vline(xintercept = 2003, linetype = "solid", color = "#9b6723",alpha = 0.35) +
# geom_text(aes(x = 2003, y = max(n) * 0.5, label = "WDR Sust Dev"),
# angle = 90, vjust = -0.5, color = "#9b6723")+
geom_vline(xintercept = 2010, linetype = "solid", color = "#9b6723",alpha = 0.35) +
geom_text(aes(x = 2010, y = max(n) * 0.2, label = "WDR Climate change"),
angle = 90, vjust = -0.5, color = "#9b6723")
pdo_clim_plotf_save_plot("pdo_clim_plot", pdo_clim_plot)— ✅ GENDER EQUALITY
the WDR of 2012 was titled “Gender Equality and Development”, link
# --- Get a LIST of unique sectors (facets) and split the data
sector_list <- base::split(x = sector_broad, f = sector_broad$tok_sector_broad)
# GENDER split df
#sector_list$GENDER_EQUAL
# Specific plot
pdo_gen_plot <- sector_plots$GENDER_EQUAL +
geom_vline(xintercept = 2012, linetype = "solid", color = "#9b6723",alpha = 0.35) +
geom_text(aes(x = 2012, y = max(n) * 0.75, label = "WDR Gender equal"),
angle = 90, vjust = -0.5, color = "#9b6723")
pdo_gen_plotf_save_plot("pdo_gen_plot", pdo_gen_plot)— INSTITUTIONAL SUPPORT
WDR 2002 ” Building Institutions for Markets” WDR 2007 ” Governance and the Law”
# Specific split df
# sector_list$INSTIT_SUP
# Specific plot
pdo_inst_plot <- sector_plots$INSTIT_SUPP +
geom_vline(xintercept = 2002, linetype = "solid", color = "#9b6723",alpha = 0.35) +
geom_text(aes(x = 2002, y = max(n) * 0.75, label = "WDR Institutions"),
angle = 90, vjust = -0.5, color = "#9b6723") +
geom_vline(xintercept = 2007, linetype = "solid", color = "#9b6723",alpha = 0.35) +
geom_text(aes(x = 2007, y = max(n) * 0.75, label = "WDR Governance"),
angle = 90, vjust = -0.5, color = "#9b6723")
pdo_inst_plotf_save_plot("pdo_inst_plot", pdo_inst_plot)— ICT
WDR 2016 ” Digital Dividends” WDR 2021 ” Data for Better Lives”
# Specific split df
# sector_list$ICT
# Specific plot
pdo_ict_plot <- sector_plots$ICT +
geom_vline(xintercept = 2016, linetype = "solid", color = "#9b6723",alpha = 0.35) +
geom_text(aes(x = 2016, y = max(n) * 0.75, label = "WDR Digital Div"),
angle = 90, vjust = -0.5, color = "#9b6723") +
geom_vline(xintercept = 2021, linetype = "solid", color = "#9b6723",alpha = 0.35) +
geom_text(aes(x = 2021, y = max(n) * 0.75, label = "WDR Data"),
angle = 90, vjust = -0.5, color = "#9b6723")
pdo_ict_plotf_save_plot("pdo_ict_plot", pdo_ict_plot)— FINANCIAL
WDR 2005 ” A Better Investment Climate for Everyone” WDR 2022 ” Finance for an Equitable Recovery”
# Specific split df
#sector_list$FINANCIAL
# Specific plot
pdo_fin_plot <- sector_plots$FIN +
geom_vline(xintercept = 2005, linetype = "solid", color = "#9b6723",alpha = 0.35) +
geom_text(aes(x = 2005, y = max(n) * 0.80, label = "WDR Inv Clim"),
angle = 90, vjust = -0.5, color = "#9b6723") +
geom_vline(xintercept = 2022, linetype = "solid", color = "#9b6723",alpha = 0.35) +
geom_text(aes(x = 2022, y = max(n) * 0.75, label = "WDR Finance"),
angle = 90, vjust = -0.5, color = "#9b6723")
pdo_fin_plotf_save_plot("pdo_fin_plot", pdo_fin_plot)_______
BIGRAMS
_______
Here I use [
clnp_annotate()output + ]dplyrto combine consecutive tokens into bigrams.
# Create bigrams by pairing consecutive tokens by sentence ID and token IDs
bigrams <- pdo_train_t %>%
# keeping FY with tokens
group_by(FY_appr, proj_id, pdo, sid ) %>%
arrange(tid) %>%
# Using mutate() and lead(), we create bigrams from consecutive tokens
mutate(next_token = lead(token),
bigram = paste(token, next_token)) %>%
# make bigram low case
mutate(bigram = tolower(bigram)) %>%
# only includes the rows where valid bigrams are formed
filter(!is.na(next_token)) %>%
ungroup() %>%
arrange(FY_appr, proj_id, sid, tid) %>%
select(FY_appr,proj_id, pdo,sid, tid, token, bigram) Clean bigrams
The challenge is to clean but without separating consecutive words… so I do this split-reunite process to remove stopwords and punctuation. Basically only keeping bigrams made of 2 nouns or ADJ+noun.
# Separate the bigram column into two words
bigrams_cleaned <- bigrams %>%
tidyr::separate(bigram, into = c("word1", "word2"), sep = " ")
# Remove stopwords and bigrams in EACH component word containing punctuation
bigrams_cleaned <- bigrams_cleaned %>%
# custom stop words
filter(!word1 %in% custom_stop_words_df$word, !word2 %in% custom_stop_words_df$word) %>%
# Remove punctuation
filter(!str_detect(word1, "[[:punct:]]"), !str_detect(word2, "[[:punct:]]"))
# Reunite the component cleaned words into the bigram column
bigrams_cleaned <- bigrams_cleaned %>%
unite(bigram, word1, word2, sep = " ") %>%
# Remove too obvious bigrams
filter(!bigram %in% c("development objective", "development objectives",
"proposed project", "project development", "program development"))
# View the cleaned dataframe
bigrams_cleaned
# Count the frequency of each bigram
bigram_freq <- bigrams_cleaned %>%
count(bigram, sort = TRUE)[FIG] most frequent bigrams in PDOs
- Excluding bigrams where 1 word is among stopwords or a punctuation sign
- Excluding “development objective/s”, “proposed project”, “program development” because not very informative
# ---- Prepare data for plotting
# Evaluate the title with glue first
title_text <- glue::glue("Frequency of bigrams in PDOs over FY {min(pdo_train_t$FY_appr)}-{max(pdo_train_t$FY_appr)}")
# Define the bigrams you want to highlight
bigrams_to_highlight <- c("public sector", "private sector", "eligible crisis",
"health care", "health services", "public health")
# ---- Plot the most frequent bigrams
pdo_bigr_freq <- bigram_freq %>%
slice_max(n, n = 25) %>%
ggplot(aes(x = reorder(bigram, n), y = n,
fill = ifelse(bigram %in% bigrams_to_highlight, bigram, "Other"))) +
geom_col() +
# coord flipped so n is Y axis
scale_y_continuous(breaks = seq(min(bigram_freq$n)-1, max(bigram_freq$n), by = 50)) +
scale_fill_manual(values = c("public sector" = "#005ca1",
"private sector" = "#9b2339",
"eligible crisis"= "#8e550a",
"health care"= "#245048",
"health services"= "#245048",
"public health"= "#245048",
"Other" = "grey")) +
guides(fill = "none") +
coord_flip() +
labs(title = title_text, subtitle = "(ranking top 25 bigrams)",
x = "", y = "") +
theme(axis.text.y = element_text(
# obtain vector of colors 2 match x axis labels color to fill
color = bigram_freq %>%
slice_max(n, n = 25) %>%
# mutate(color = ifelse(bigram %in% bigrams_to_highlight,
# ifelse(bigram == "public sector", "#005ca1",
# ifelse(bigram == "private sector", "#9b2339", "#8e550a")),
# "#4c4c4c"))
mutate(color = case_when (
bigram == "public sector" ~ "#005ca1",
bigram == "private sector" ~ "#9b2339",
bigram == "eligible crisis" ~ "#8e550a",
bigram %in% c("health care", "health services", "public health") ~ "#245048",
TRUE ~ "#4c4c4c")) %>%
# Ensure the order matches the reordered bigrams (AS BINS)
arrange(reorder(bigram, n)) %>%
# Extract the color column in bin order as vector to be passed to element_text()
pull(color)
)
) + my_pretty_theme
pdo_bigr_freqResults are not surprising in terms of frequent bigram recurrence:
- See for example “increase access”, “service delivery” ,“institutional capacity”, “poverty reduction” etc, at the top.
- Although, while “health” recurred in several bigrams (e.g. “health services”, “public health”, “health care”) among the top 25, “education” did not appear at all.
- A bit mysterious is perhaps “eligible crisis” (> 100 mentions)?! (coming back to this later)
[FIG] Changes over time BY 1FY
Besides huge, counter intuitive, difference between “health” and “education”, “climate change” appears in the top 25 (ranking above “financial sector” and “capacity building”) which begs the question: Has the frequency of these bigrams has changed over time?
## too busy to be useful
# Step 1: Count the frequency of each bigram by year
top_bigrams_1FY <- bigrams_cleaned %>%
group_by(FY_appr, bigram) %>%
summarise(count = n(), .groups = 'drop') %>%
arrange(FY_appr, desc(count)) %>%
# --- +/- top 10
group_by(FY_appr) %>%
top_n(10, count) %>%
ungroup()
# # --- STRICT top 10
# mutate(rank = dense_rank(desc(count))) %>% # Rank bigrams by frequency
# filter(rank <= 10) %>% # Keep only the top 10 by rank
# ungroup()
# Add specific bigrams to highlight, if any
bigrams_to_highlight <- c("climate change", "climate resilience", "public sector", "private sector")
# Step 2: Plot the top bigrams by frequency over time
pdo_bigr_FY_freq <- top_bigrams_1FY %>%
ggplot(aes(x = reorder(bigram, count),
y = count,
fill = ifelse(bigram %in% bigrams_to_highlight, bigram, "Other"))) +
geom_col() +
scale_fill_manual(values = c("public sector" = "#005ca1", "private sector" = "#e60066",
"climate change" = "#399B23", "climate resilience" = "#d8e600",
"Other" = "grey")) +
guides(fill = "none") +
coord_flip() +
facet_wrap(~ FY_appr, scales = "free_y") +
labs(title = "Top 10 Bigrams by Frequency Over Time",
subtitle = "(Faceted by Fiscal Year Approval)",
x = "Bigrams",
y = "Count") +
theme_minimal() +
theme(plot.title.position = "plot",
axis.text.x = element_text(angle = 45, hjust = 1))+
my_pretty_theme
pdo_bigr_FY_freq[FIG] Changes over time BY 3FY
To reduce the noise and make the plot more readable, we can group the data by 3 fiscal years (FY) intervals.
# generate FY group
f_generate_year_groups <- function(years, interval) {
breaks <- seq(floor(min(years, na.rm = TRUE) / interval) * interval,
ceiling(max(years, na.rm = TRUE) / interval) * interval,
by = interval)
labels <- paste(breaks[-length(breaks)], "-", breaks[-1] - 1)
return(list(breaks = breaks, labels = labels))
}# --- Step 1: Create n-year groups (using `f_generate_year_groups`)
interval_i = 3 # decide the interval
year_groups <- f_generate_year_groups(bigrams_cleaned$FY_appr, interval = interval_i)
top_n_i = 12 # decide the top n bigrams to show
# --- Step 2: Add the generated FY breaks and labels to data frame
top_bigrams_FYper <- bigrams_cleaned %>%
# cut divides the range of x into intervals
mutate(FY_group = base::cut(FY_appr,
breaks = year_groups$breaks,
include.lowest = TRUE,
right = FALSE,
labels = year_groups$labels)) %>%
# Count the frequency of each bigram by n-year groups
group_by(FY_group, bigram) %>%
summarise(count = n(), .groups = 'drop') %>%
arrange(FY_group, desc(count)) %>%
# Top ? bigrams for each n-year period
group_by(FY_group) %>%
top_n(top_n_i, count) %>%
ungroup()
# --- Step 3: Add specific bigrams to highlight, if any
bigrams_to_highlight <- c("climate change", "climate resilience",
"eligible crisis",
"public sector", "private sector",
"water supply", "sanitation services",
"health care", "health services", "public health", "health preparedness"
)
# --- Step 4: Plot the top bigrams by frequency over n-year periods
pdo_bigr_FY_freq <- top_bigrams_FYper %>%
ggplot(aes(x = reorder(bigram, count),
y = count,
fill = ifelse(bigram %in% bigrams_to_highlight, bigram, "Other"))) +
geom_col() +
scale_fill_manual(values = c(
# "public sector" = "#005ca1",
# "private sector" = "#e60066",
"water supply" = "#26BDE2",
"sanitation services" = "#26BDE2",
"climate change" = "#3F7E44",
"climate resilience" = "#a6bd23",
"eligible crisis" = "#e68000",
"health care" = "#E5243B",
"health services" = "#E5243B",
"public health" = "#E5243B",
"Other" = "grey")) +
guides(fill = "none") +
coord_flip() +
facet_wrap(~ FY_group, ncol = 2 , scales = "free_y" )+
#strip.position = "top") + # Facet wrap with columns
labs(title = glue::glue("Top 12 Bigrams by Frequency Over {interval_i}-Year Periods"),
subtitle = "(Some sectors highlighted)",
x = "",
y = "") +
my_pretty_theme# print the plot
pdo_bigr_FY_freqFrequency observed over FY intervals is very revealing.
- Interesting to see the trend of “water supply” and “sanitation services” bigrams, which are quite stable over time.
- The bigram “health care” and “health services” are also quite stable, while “public health” obviously gained relevance since the 2019-2021 FY period.
- Conversely, “private sector” and “public sector” loose importance over time (around mid 2010s), while “climate change” and “climate resilience” gain relevance from the same point on.
- Still quite surprising the bigram “eligible crisis”, which actually appears in the top 12 bigrams starting in FY 2016-2018!
🤔 Which are the most frequent and persistent Bigrams Over Time?
For this, I am looking for a ranking that considers Mean frequency across periods
arrange(desc(mean_count))+ Stability (low standard deviation) across periods [this is hard bc of NAs], and NOT total count overall…
- Using
top_bigrams_FYperwhich had breaks of 3FY
# ------------------------------[REPEATED just to see the table]
# --- Step 1: Create n-year groups (using `f_generate_year_groups`)
interval_i = 3 # decide the interval
year_groups <- f_generate_year_groups(bigrams_cleaned$FY_appr, interval = interval_i)
top_n_i = 12 # decide the top n bigrams to show
# --- Step 2: Add the generated FY breaks and labels to data frame
top_bigrams_FYper <- bigrams_cleaned %>%
# cut divides the range of x into intervals
mutate(FY_group = base::cut(FY_appr,
breaks = year_groups$breaks,
include.lowest = TRUE,
right = FALSE,
labels = year_groups$labels)) %>%
# Count the frequency of each bigram by n-year groups
group_by(FY_group, bigram) %>%
summarise(count = n(), .groups = 'drop') %>%
arrange(FY_group, desc(count)) %>%
# Top ? bigrams for each n-year period
group_by(FY_group) %>%
top_n(top_n_i, count) %>%
ungroup()sd() returns NA for bigrams that are not present in any periods (or are present in just 1 period).
# Calculate the mean frequency and standard deviation of the counts for each bigram across periods
stable_and_frequent_bigrams_per <- top_bigrams_FYper %>%
group_by(bigram) %>%
summarise(mean_count = mean(count, na.rm = TRUE), # Mean frequency across periods
sd_count = sd(count, na.rm = TRUE), # Stability (lower sd = more stable)
count_non_na = sum(!is.na(count)), # Count non-NA values
sd_count2 = if_else(count_non_na >= 1, sd(count, na.rm = TRUE), NA_real_), # Only calculate sd if >= 3 non-NA
total_count = sum(count)) %>% # Total count across all periods (optional)
arrange(desc(mean_count)) %>% # Sort by frequency and then stability
# Filter out bigrams with low mean frequency or high instability (you can adjust thresholds)
# Focus on the top 25% most frequent bigrams
filter(mean_count > quantile(mean_count, 0.70, na.rm = TRUE)) #%>%
# Focus on the most stable 50% (lower sd) ---> NO bc NA values
#filter( sd_count < quantile(sd_count, 0.5, na.rm = TRUE))[TBL] Bigrams Over Time [3FY]
# View the most frequent and stable bigrams
stable_and_frequent_bigrams_per %>%
slice_head(n = 15) %>% kableExtra::kable()| bigram | mean_count | sd_count | count_non_na | sd_count2 | total_count |
|---|---|---|---|---|---|
| increase access | 39.83333 | 6.080022 | 6 | 6.080022 | 239 |
| eligible crisis | 37.33333 | 1.527525 | 3 | 1.527525 | 112 |
| threat posed | 33.00000 | NA | 1 | NA | 33 |
| private sector | 31.20000 | 10.917875 | 5 | 10.917875 | 156 |
| health preparedness | 31.00000 | NA | 1 | NA | 31 |
| strengthen national | 28.00000 | NA | 1 | NA | 28 |
| service delivery | 27.71429 | 5.313953 | 7 | 5.313953 | 194 |
| climate change | 27.00000 | 2.828427 | 2 | 2.828427 | 54 |
| poverty reduction | 27.00000 | 14.514361 | 4 | 14.514361 | 108 |
| public health | 25.50000 | 16.263456 | 2 | 16.263456 | 51 |
| public sector | 25.25000 | 8.301606 | 4 | 8.301606 | 101 |
| institutional capacity | 24.87500 | 6.577831 | 8 | 6.577831 | 199 |
| improve access | 24.57143 | 8.521681 | 7 | 8.521681 | 172 |
| national systems | 24.00000 | NA | 1 | NA | 24 |
- Using
top_bigrams_1FYwhich had breaks of 1FY
# --- Step 1: Create n-year groups (using `f_generate_year_groups`)
interval_i = 1 # decide the interval
year_groups <- f_generate_year_groups(bigrams_cleaned$FY_appr, interval = interval_i)
top_n_i = 12 # decide the top n bigrams to show
# --- Step 2: Add the generated FY breaks and labels to data frame
top_bigrams_1FY <- bigrams_cleaned %>%
# cut divides the range of x into intervals
mutate(FY_group = base::cut(FY_appr,
breaks = year_groups$breaks,
include.lowest = TRUE,
right = FALSE,
labels = year_groups$labels)) %>%
# Count the frequency of each bigram by n-year groups
group_by(FY_group, bigram) %>%
summarise(count = n(), .groups = 'drop') %>%
arrange(FY_group, desc(count)) %>%
# Top ? bigrams for each n-year period
group_by(FY_group) %>%
top_n(top_n_i, count) %>%
ungroup()# Calculate the mean frequency and standard deviation of the counts for each bigram across periods
stable_and_frequent_bigrams_1FY <- top_bigrams_1FY %>%
group_by( bigram) %>%
summarise(mean_count = mean(count, na.rm = TRUE), # Mean frequency across periods
sd_count = sd(count, na.rm = TRUE), # Stability (lower sd = more stable)
total_count = sum(count)) %>% # Total count across all periods (optional)
arrange(desc(mean_count)) %>% # Sort by frequency and then stability
# Filter out bigrams with low mean frequency or high instability (you can adjust thresholds)
# Focus on the top 25% most frequent bigrams
filter(mean_count > quantile(mean_count, 0.70, na.rm = TRUE)) #%>%
# Focus on the most stable 50% (lower sd) ---> NO bc NA values
#filter( sd_count < quantile(sd_count, 0.5, na.rm = TRUE))[TBL] Bigrams Over Time [1FY]
# View the most frequent and stable bigrams
stable_and_frequent_bigrams_1FY %>%
slice_head(n = 15) %>% kableExtra::kable()| bigram | mean_count | sd_count | total_count |
|---|---|---|---|
| mobile applications | 21.00000 | NA | 21 |
| public health | 16.66667 | 3.0550505 | 50 |
| threat posed | 16.50000 | 2.1213203 | 33 |
| health preparedness | 15.50000 | 0.7071068 | 31 |
| increase access | 14.64706 | 5.1713293 | 249 |
| eligible crisis | 14.62500 | 10.1971635 | 117 |
| strengthen national | 14.00000 | 2.8284271 | 28 |
| vulnerable households | 13.00000 | NA | 13 |
| respond promptly | 12.50000 | 10.6066017 | 25 |
| action plan | 12.00000 | NA | 12 |
| disaster risk | 12.00000 | NA | 12 |
| local governments | 12.00000 | NA | 12 |
| national systems | 12.00000 | 1.4142136 | 24 |
| world bank | 12.00000 | NA | 12 |
| climate resilience | 11.66667 | 4.5092498 | 35 |
_______
Explore specific bigrams
— Public/Private ~ compare frequency over FY
A case in which looking at bigrams may be better than tokens is the question whether WB project are more focused on public or private sector. It is not easy to capture this information from the text, because:
- “government” may be referred to the subject/counterpart of the project (e.g. “government of Mozambique”)
- “private” is not necessarily referred to the “private sector” (e.g. “private households”)
- “public” is not necessarily referred to the “public sector” (e.g. “public health”)
So, I narrow down to consecutive bigrams “public sector” and “private sector” to get an indicative frequency of these terms.
[FIG] Bigrams (“public sector”, “private sector”) freq plot
# Filter for the specific bigrams "public sector" and "private sector"
bigrams_pub_priv_sec <- bigrams %>%
filter(bigram %in% c("public sector", "private sector"))
# Display the result
#bigrams_pub_priv_sec
# prepare data for plotting (count)
sector_bigr_df <- bigrams_pub_priv_sec %>%
count(FY_appr, bigram) %>%
# reorder values by frequency
mutate(bigram = factor(bigram, levels = c("public sector", "private sector")))# ---- Prepare data for plotting
# Evaluate the title with glue first
title_text <- glue::glue("Frequency of bigrams \"public sector\" and \"private sector\" in PDOs over FY {min(sector_bigr_df$FY_appr)}-{max(sector_bigr_df$FY_appr)}")
two_col_contrast <- c( "#005ca1", "#e60066" )
# Create a named vector for the legend labels with totals in a single pipeline
legend_labels <- sector_bigr_df %>%
group_by(bigram) %>%
# Calculate total counts for each bigram
summarize(total_n = sum(n)) %>%
# Append totals to bigram names
mutate(label = paste0(bigram, " (", total_n, ")")) %>%
# Create a named vector with bigram as names and labels as values
{setNames(.$label, .$bigram)} # curly braces {} in a dplyr pipeline using . as ouptu from previous..
# ---- Plot
pdo_pub_pri_bigr <- ggplot(data = sector_bigr_df, aes(x = FY_appr, y = n, group = bigram, color = bigram)) +
geom_line(linetype = "dotted", alpha = 0.75, size = 1) +
geom_point(size = 3) +
scale_x_continuous(breaks = seq(2001, 2023, by = 1)) +
scale_color_manual(values = two_col_contrast,
labels = legend_labels) + # Use modified labels
my_pretty_theme +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = title_text,
x = "",
y = "",
color = "")
pdo_pub_pri_bigr# Save the plot
f_save_plot("pdo_pub_pri_bigr", pdo_pub_pri_bigr)Note:
- these are much less common than the single words.
- What happens in FY 2014-2016 that makes these bigram drop in frequency of mention?
— climate change ~ notable bigrams over FY [CMPL 🟠]
— eligible crisis ~ notable bigrams over FY
First of all, let’s see what are the sentence that contain the bigram “eligible crisis” in the PDOs.
# ---- Define the bigram you want to find
target_bigram <- "eligible crisis"
# Tokenize the text data into sentences
sentences <- pdo_t %>%
unnest_tokens(sentence, pdo, token = "sentences")
# Count the number of sentences in each document
sentence_count <- sentences %>%
group_by(proj_id) %>%
summarise(num_sentences = n())
n_distinct(sentence_count$proj_id) # number of projects
sum(sentence_count$num_sentences) # total number of sentences
# Filter sentences that contain the specific bigram
sentences_with_targ <- sentences %>%
filter(str_detect(sentence, target_bigram))
# Define how many characters before and after the bigram to extract
chars_before <- 60 # Number of characters before the bigram
chars_after <- 60 # Number of characters after the bigram
# Add the extracted bigram and surrounding characters to the same dataframe
sentences_with_eligcris <- sentences_with_targ %>%
mutate(closest_text = str_extract(sentence, paste0(".{0,", chars_before, "}", target_bigram, ".{0,", chars_after, "}"))) %>%
# View the updated dataframe with the closest_text column
select(proj_id, #sentence,
closest_text)[TBL] Close phrase around bigram “eligible crisis”
I still don’t know what “eligible” crisis means, but it appears that the following is like a commonly used phrase in the PDOs: “to respond promptly and effectively” as well as “provide immediate and effective response to” seem to often accompany the text “eligible crisis or emergency”. Presumably, a standard sentence indicating eligibility for ODA funding.
# Define the phrase you want to search for in the vicinity of the target bigram
phrase_to_search <- "respond promptly and effectively"
# Count how often the phrase appears in the vicinity of the target bigram
phrase_count <- sentences_with_eligcris %>%
mutate(contains_phrase = str_detect(closest_text, phrase_to_search)) %>% # Check if the phrase is present
summarise(count = sum(contains_phrase)) # Count how many times the phrase is found
# View the result
tabyl(phrase_count$count)Here are a few examples of the sentences containing the bigram “eligible crisis” and the phrase “respond promptly and effectively”:
# Filter the sentences that contain the phrase
sample_with_eligcris <- sentences_with_eligcris %>%
ungroup() %>%
# take a random sample of 5 sentences
sample_n(8 ) %>%
select(proj_id, closest_text) %>%
mutate (closest_text = paste0("(...) ", closest_text),
# Make "eligible crisis" bold by adding <b> tags
closest_text = gsub("eligible crisis", "<b>eligible crisis</b>", closest_text)
)
# print out sample in a kable
kable(sample_with_eligcris, format = "html",
# Display the table with bold formatting
escape = FALSE,
col.names = c("WB Project ID","excerpt of PDO sentences with 'eligible crisis'")) %>%
kable_styling(full_width = FALSE) | WB Project ID | excerpt of PDO sentences with 'eligible crisis' |
|---|---|
| P178816 | (...) lactating women in the project regions and to respond to an eligible crisisor emergency. |
| P171040 | (...) ; and (c) to provide immediate and effective response to an eligible crisis or emergency |
| P179592 | (...) project target areas in timor-leste, and (b) in case of an eligible crisis or emergency, respond promptly and effectively to it. |
| P174077 | (...) of the recipient's transport sector, and in the event of an eligible crisis or emergency, to providean immediate response to the eligib |
| P173783 | (...) reparedness in the recipient’s territory, and in case of an eligible crisis or emergency, respond promptly and effectively to it. |
| P178419 | (...) and (iii) to provide immediate and effective response to an eligible crisis or emergency. |
| P176152 | (...) reparedness in the recipient’s territory, and in case of an eligible crisis or emergency, respond promptly and effectively to it. |
| P166991 | (...) cture of the recipient’s highway network, and in case of an eligible crisis or emergency, respond promptly and effectively to it. |
Analyzing bigrams: tf-idf
There are advantages and disadvantages to examining the tf-idf of bigrams rather than individual words. Pairs of consecutive words might capture structure that isn’t present when one is just counting single words, and may provide context that makes tokens more understandable. However, the per-bigram counts are also sparser (a typical two-word pair is rarer than either of its component words).
_______
_______
===== !!!!! QUI !!!!! ======
_______
CONCORDANCES - KWIC
A more evolved way to look at the data is to use the KWIC (Key Word In Context) method. This method allows us to see the context in which a word appears in the text. + {Following LADAL Tutorial} + {Following Ben Schmidt, Chp 8.2.3} + https://alvinntnu.github.io/NTNU_ENC2036_LECTURES/corpus-analysis-a-start.html
In the language sciences, concordancing refers to the extraction of words from a given text or texts. Concordances are commonly displayed in the form of keyword-in-context displays (KWICs) where the search term is shown in context, i.e. with preceding and following words.
Concordancing is central to analyses of text and they often represents the first step in more sophisticated analyses of language data, because concordances are extremely valuable for understanding how a word or phrase is used, how often it is used, and in which contexts is used. As concordances allow us to analyze the context in which a word or phrase occurs and provide frequency information about word use, they also enable us to analyze collocations or the collocational profiles of words and phrases (Stefanowitsch 2020, 50–51). Finally, concordances can also be used to extract examples and it is a very common procedure.
— Concordances
- https://ladal.edu.au/textanalysis.html#Concordancing
- https://www.quantumjitter.com/project/deal/
Using
quanteda
file:///Users/luisamimmi/Github/slogan_old/docs/01b_WDR_data-exploration_abstracts.html
# I use again data = pdo_words
pdo_q_corpus <- quanteda::corpus(as.data.frame(projs_train),
docid_field = "id",
text_field = "pdo",
meta = list("pr_name", "boardApprovalFY")
)
# --- example with individual keyword
# Step 1) tokens
pdo_q_tokens <- quanteda::tokens(x = pdo_q_corpus,
remove_punct = TRUE,
remove_symbols = TRUE#,remove_numbers = TRUE
) %>%
quanteda::tokens_tolower() #%>%
#quanteda::tokens_remove(pattern = custom_stop_words) %>%
#quanteda::tokens_remove(pattern = c("project", "development", "bank", "world", "project", "projects"))
# #______ Step 2) kwic (individual exe )
# kwic_pdo_data <- quanteda::kwic(x = pdo_q_tokens, # define text(s)
# # define pattern
# pattern = quanteda::phrase(c("gender", "climate", "sustainab*")),
# # define window size
# window = 5) %>%
# # convert into a data frame
# as_tibble() %>%
# left_join(projs_train, by = c("docname" = "id")) %>%
# # remove superfluous columns
# dplyr::select( 'Year' = boardapprovalFY, 'Prj title' = pr_name, pre, keyword, post) %>%
# # slice_sample( n = 50) %>%
# kbl(align = "c") # %>% kable_styling()
# ____ Step 2) kwic (on vector)
# Iterate `quanteda::kwic` over a vector of tokens | regex-modified-keywords
keywords <- c("gender", "climate", "sustainab*", "conditional*" )
# apply iteratively kwic over a vector of keywords
outputs_key <- map(keywords,
~quanteda::kwic(pdo_q_tokens,
pattern = .x,
window = 5) %>%
as_tibble() %>%
left_join(projs_train, by = c("docname" = "id")) %>%
# remove superfluous columns
dplyr::select( 'Year' = boardapprovalFY, 'Prj title' = pr_name, pre, keyword, post)
)
# # all togetha 3
n = length(keywords)
# check the first element
outputs_key[[1]] %>%
kbl(align = "c")
outputs_key[[2]] %>%
kbl(align = "c")
# this list has no element names
names(outputs_key)— create kwic with phrases | purrr + print + save png
# Iterate `quanteda::kwic` over a vector of phrases/bigrams
keywords_phrase <- c("climate change", "investment climate", "pro-poor",
"gender equality", "maximizing finance", "digital revolution")
# Step 1) tokens
# (done above) -> abs_q_tokens
# Step 2) kwic
# apply iteratively kwic over a vector of bigrams
outputs_bigrams <- map(keywords_phrase,
~quanteda::kwic(x = pdo_q_tokens, # define text(s)
# define pattern
pattern = quanteda::phrase(.x),
# define window size
window = 5) %>%
# convert into a data frame
as_tibble() %>%
left_join(projs_train, by = c("docname" = "id")) %>%
## remove superfluous columns
dplyr::select( 'Year' = boardapprovalFY, 'Prj title' = pr_name, pre, keyword, post)
)
# number ofo cbigrams
n_bi = length(keywords_phrase)
n_bi # 7
# name this list's elements
outputs_bigrams <- outputs_bigrams %>%
set_names(paste0("kwic_", keywords_phrase))
# get rid of empty output dfs in list
outputs_bigrams2 <- outputs_bigrams[sapply(
outputs_bigrams, function(x) dim(x)[1]) > 0] # 4 left!
#or
outputs_bigrams3 <- purrr::keep(outputs_bigrams, ~nrow(.) > 0) # 4 left!
# -------------- print all
# walk + print -
walk(.x = outputs_bigrams2, .f = print)
# -------------- save all -> create multiple tables from a single dataframe and save them as images
# https://stackoverflow.com/questions/69323569/how-to-save-multiple-tables-as-images-using-kable-and-map/69323893#69323893
outputs_bigrams2 %>%
imap(~save_kable(file = paste0('analysis/output/tables/', 'pdo_', .y, '_.png'),
# bs_theme = 'journal',
self_contained = T,
x = kbl(.x, booktabs = T, align = c('l','l', 'c')) %>%
kable_styling()
)
)Acknowledgements
Below are some invaluable resources that I used to learn and implement the NLP techniques in this analysis: + An Introduction to Quantitative Text Analysis for Linguistics
— SOCIAL PROTECTION
WDR 2004 ” Making Services Work for Poor People”